Introduction

In today’s data-driven world, the ability to forecast accurately is crucial for strategic planning and decision-making. This project aims to harness the power of historical data to generate reliable forecasts for various series within a de-identified dataset. With a comprehensive dataset spanning 1622 periods, we will employ advanced data analysis and forecasting techniques to predict future trends for the next 140 periods.

This report is crafted to cater to a diverse audience, ranging from individuals with no background in data science to seasoned data scientists. Our goal is to present the analysis and findings in a clear, concise, and accessible manner. We will explain the methodologies used, the rationale behind their selection, and the step-by-step process of our analysis, ensuring transparency and comprehensibility for all readers.

The report is structured to balance technical rigor with simplicity, providing a thorough yet comprehensible narrative of our forecasting journey. We begin with an exploration of the data, followed by a detailed explanation of the forecasting methods applied, and conclude with the results and their implications. Visualizations and key insights will be highlighted to enhance understanding and readability.

Loading necessary Packages

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'zoo'
## 
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Data exploration

Read the excel file and extract the data

## Response [https://raw.githubusercontent.com/waheeb123/Data-624/main/Projects/Data%20Set%20for%20Class.xls]
##   Date: 2024-06-22 05:44
##   Status: 200
##   Content-Type: application/octet-stream
##   Size: 1.33 MB
## <ON DISK>  /var/folders/lt/23m_3s6j42vgpv1ss4d6nkgc0000gn/T//RtmpU0d2eF/file8ab3ae6ddb9.xls

using glimpse, so we can view the attributes of our dataset

## Rows: 10,572
## Columns: 7
## $ SeriesInd <dbl> 40669, 40669, 40669, 40669, 40669, 40669, 40670, 40670, 4067…
## $ category  <chr> "S03", "S02", "S01", "S06", "S05", "S04", "S03", "S02", "S01…
## $ Var01     <dbl> 30.64286, 10.28000, 26.61000, 27.48000, 69.26000, 17.20000, …
## $ Var02     <dbl> 123432400, 60855800, 10369300, 39335700, 27809100, 16587400,…
## $ Var03     <dbl> 30.34000, 10.05000, 25.89000, 26.82000, 68.19000, 16.88000, …
## $ Var05     <dbl> 30.49000, 10.17000, 26.20000, 27.02000, 68.72000, 16.94000, …
## $ Var07     <dbl> 30.57286, 10.28000, 26.01000, 27.32000, 69.15000, 17.10000, …

Dataset have 10,572 rows and 7 columns.

First 6 Rows of Data
SeriesInd category Var01 Var02 Var03 Var05 Var07
40669 S03 30.64286 123432400 30.34 30.49 30.57286
40669 S02 10.28000 60855800 10.05 10.17 10.28000
40669 S01 26.61000 10369300 25.89 26.20 26.01000
40669 S06 27.48000 39335700 26.82 27.02 27.32000
40669 S05 69.26000 27809100 68.19 68.72 69.15000
40669 S04 17.20000 16587400 16.88 16.94 17.10000

check if there are any missing values in the entire dataframe

## [1] TRUE

count the number of missing values in the entire dataframe

## [1] 4294

Data Preparation

Data Subsetting

In order to forecast each variable within each group, we needed to break the larger dataset into its individual groups so we could perform our analysis and properly visualize our data.

Plotting Data

We can see variation within this type of variable is considerably less than in the the other type. Additionally, it looks like there is no apparent seasonality. In many of the variables of this type of data, we did see long trends (either upward or downward) and cyclicity. In looking at all the plots for every variable, we determined that further analysis was necessary to determine trend and seasonal components of the data.

Data Preparation

Many of the time series algorithms, both for forecasting and visualization, require that there be no missing values in the data. With this requirement, we deemed it necessary to fill nulls early in our data prep process.

Data Cleaning

In several of the datasets, there are clear outliers that would strongly influence the models we built.We used the Interquartile Range (IQR). The IQR is a measure of statistical dispersion, or how spread out the values in a dataset are. The process involves identifying values that fall within 1.5 times the IQR below the first quartile (Q1) and above the third quartile (Q3). Values outside this range are considered outliers.

# Filtering outliers for each subset
subset_S01_clean_Var01 <- subset_S01 %>%
  filter(Var01 >= quantile(Var01, 0.25, na.rm = TRUE) - 1.5 * IQR(Var01, na.rm = TRUE) & 
           Var01 <= quantile(Var01, 0.75, na.rm = TRUE) + 1.5 * IQR(Var01, na.rm = TRUE))

subset_S01_clean_Var02 <- subset_S01 %>%
  filter(Var02 >= quantile(Var02, 0.25, na.rm = TRUE) - 1.5 * IQR(Var02, na.rm = TRUE) & 
           Var02 <= quantile(Var02, 0.75, na.rm = TRUE) + 1.5 * IQR(Var02, na.rm = TRUE))

subset_S02_clean_Var02 <- subset_S02 %>%
  filter(Var02 >= quantile(Var02, 0.25, na.rm = TRUE) - 1.5 * IQR(Var02, na.rm = TRUE) & 
           Var02 <= quantile(Var02, 0.75, na.rm = TRUE) + 1.5 * IQR(Var02, na.rm = TRUE))

subset_S02_clean_Var03 <- subset_S02 %>%
  filter(Var03 >= quantile(Var03, 0.25, na.rm = TRUE) - 1.5 * IQR(Var03, na.rm = TRUE) & 
           Var03 <= quantile(Var03, 0.75, na.rm = TRUE) + 1.5 * IQR(Var03, na.rm = TRUE))

subset_S03_clean_Var05 <- subset_S03 %>%
  filter(Var05 >= quantile(Var05, 0.25, na.rm = TRUE) - 1.5 * IQR(Var05, na.rm = TRUE) & 
           Var05 <= quantile(Var05, 0.75, na.rm = TRUE) + 1.5 * IQR(Var05, na.rm = TRUE))

subset_S03_clean_Var07 <- subset_S03 %>%
  filter(Var07 >= quantile(Var07, 0.25, na.rm = TRUE) - 1.5 * IQR(Var07, na.rm = TRUE) & 
           Var07 <= quantile(Var07, 0.75, na.rm = TRUE) + 1.5 * IQR(Var07, na.rm = TRUE))

subset_S04_clean_Var01 <- subset_S04 %>%
  filter(Var01 >= quantile(Var01, 0.25, na.rm = TRUE) - 1.5 * IQR(Var01, na.rm = TRUE) & 
           Var01 <= quantile(Var01, 0.75, na.rm = TRUE) + 1.5 * IQR(Var01, na.rm = TRUE))

subset_S04_clean_Var02 <- subset_S04 %>%
  filter(Var02 >= quantile(Var02, 0.25, na.rm = TRUE) - 1.5 * IQR(Var02, na.rm = TRUE) & 
           Var02 <= quantile(Var02, 0.75, na.rm = TRUE) + 1.5 * IQR(Var02, na.rm = TRUE))

subset_S05_clean_Var02 <- subset_S05 %>%
  filter(Var02 >= quantile(Var02, 0.25, na.rm = TRUE) - 1.5 * IQR(Var02, na.rm = TRUE) & 
           Var02 <= quantile(Var02, 0.75, na.rm = TRUE) + 1.5 * IQR(Var02, na.rm = TRUE))

subset_S05_clean_Var03 <- subset_S05 %>%
  filter(Var03 >= quantile(Var03, 0.25, na.rm = TRUE) - 1.5 * IQR(Var03, na.rm = TRUE) & 
           Var03 <= quantile(Var03, 0.75, na.rm = TRUE) + 1.5 * IQR(Var03, na.rm = TRUE))

subset_S06_clean_Var05 <- subset_S06 %>%
  filter(Var05 >= quantile(Var05, 0.25, na.rm = TRUE) - 1.5 * IQR(Var05, na.rm = TRUE) & 
           Var05 <= quantile(Var05, 0.75, na.rm = TRUE) + 1.5 * IQR(Var05, na.rm = TRUE))

subset_S06_clean_Var07 <- subset_S06 %>%
  filter(Var07 >= quantile(Var07, 0.25, na.rm = TRUE) - 1.5 * IQR(Var07, na.rm = TRUE) & 
           Var07 <= quantile(Var07, 0.75, na.rm = TRUE) + 1.5 * IQR(Var07, na.rm = TRUE))

Data Imputation

In data preparation, imputation of missing values is a critical step. Each of the variables provided had missing values within the data and many approaches for filling them seemed appropriate. The approaches we took differed depending on if we were imputing for the first or second type of data. For the data that had large variations from day to day (that looked almost like white noise) we deemed taking an average appropriate. For the more stable datasets, we decided to use linear interpolation to fill the missing values with the previous value in the dataset. This seemed to make sense as each point in this datasets never seemed to be very far away from the previous one.

# Impute missing values using linear interpolation for subsets
subset_S06_clean_Var07$Var05 <- na.approx(subset_S06_clean_Var07$Var05)
subset_S06_clean_Var07$Var07 <- na.approx(subset_S06_clean_Var07$Var07)
subset_S05_clean_Var02$Var03 <- na.approx(subset_S05_clean_Var02$Var03)
subset_S05_clean_Var02$Var02 <- na.approx(subset_S05_clean_Var02$Var02)
subset_S04_clean_Var01$Var02 <- na.approx(subset_S04_clean_Var01$Var02)
subset_S04_clean_Var01$Var01 <- na.approx(subset_S04_clean_Var01$Var01)
subset_S03_clean_Var07$Var05 <- na.approx(subset_S03_clean_Var07$Var05)
subset_S03_clean_Var07$Var07 <- na.approx(subset_S03_clean_Var07$Var07)
subset_S02_clean_Var02$Var03 <- na.approx(subset_S02_clean_Var02$Var03)
subset_S02_clean_Var02$Var02 <- na.approx(subset_S02_clean_Var02$Var02)
subset_S01_clean_Var01$Var02 <- na.approx(subset_S01_clean_Var01$Var02)
subset_S01_clean_Var01$Var01 <- na.approx(subset_S01_clean_Var01$Var01)

# Find the last observation index for subsets
last_observation_index_S06 <- max(which(!is.na(subset_S06_clean_Var07$Var05)))
last_observation_index_S05 <- max(which(!is.na(subset_S05_clean_Var02$Var03)))
last_observation_index_S04 <- max(which(!is.na(subset_S04_clean_Var01$Var02)))
last_observation_index_S03 <- max(which(!is.na(subset_S03_clean_Var07$Var05)))
last_observation_index_S02 <- max(which(!is.na(subset_S02_clean_Var02$Var03)))
last_observation_index_S01 <- max(which(!is.na(subset_S01_clean_Var01$Var02)))

# Create time series objects for forecasting
ts_S06_Var05 <- ts(subset_S06_clean_Var07$Var05[1:last_observation_index_S06])
ts_S06_Var07 <- ts(subset_S06_clean_Var07$Var07[1:last_observation_index_S06])
ts_S05_Var02 <- ts(subset_S05_clean_Var02$Var02[1:last_observation_index_S05])
ts_S05_Var03 <- ts(subset_S05_clean_Var02$Var03[1:last_observation_index_S05])
ts_S04_Var01 <- ts(subset_S04_clean_Var01$Var01[1:last_observation_index_S04])
ts_S04_Var02 <- ts(subset_S04_clean_Var01$Var02[1:last_observation_index_S04])
ts_S03_Var05 <- ts(subset_S03_clean_Var07$Var05[1:last_observation_index_S03])
ts_S03_Var07 <- ts(subset_S03_clean_Var07$Var07[1:last_observation_index_S03])
ts_S02_Var02 <- ts(subset_S02_clean_Var02$Var02[1:last_observation_index_S02])
ts_S02_Var03 <- ts(subset_S02_clean_Var02$Var03[1:last_observation_index_S02])
ts_S01_Var01 <- ts(subset_S01_clean_Var01$Var01[1:last_observation_index_S01])
ts_S01_Var02 <- ts(subset_S01_clean_Var01$Var02[1:last_observation_index_S01])

Time Series Forecasting

# Forecast using auto.arima
forecast_S06_Var05 <- forecast(auto.arima(ts_S06_Var05), h = 140)
forecast_S06_Var07 <- forecast(auto.arima(ts_S06_Var07), h = 140)
forecast_S05_Var02 <- forecast(auto.arima(ts_S05_Var02), h = 140)
forecast_S05_Var03 <- forecast(auto.arima(ts_S05_Var03), h = 140)
forecast_S04_Var01 <- forecast(auto.arima(ts_S04_Var01), h = 140)
forecast_S04_Var02 <- forecast(auto.arima(ts_S04_Var02), h = 140)
forecast_S03_Var05 <- forecast(auto.arima(ts_S03_Var05), h = 140)
forecast_S03_Var07 <- forecast(auto.arima(ts_S03_Var07), h = 140)
forecast_S02_Var02 <- forecast(auto.arima(ts_S02_Var02), h = 140)
forecast_S02_Var03 <- forecast(auto.arima(ts_S02_Var03), h = 140)
forecast_S01_Var01 <- forecast(auto.arima(ts_S01_Var01), h = 140)
forecast_S01_Var02 <- forecast(auto.arima(ts_S01_Var02), h = 140)

# Create dataframe for forecasts
forecasts_df_S06 <- data.frame(
  SeriesInd = (subset_S06_clean_Var07$SeriesInd[last_observation_index_S06] + 1):(subset_S06_clean_Var07$SeriesInd[last_observation_index_S06] + 140),
  category = rep("S06", 140),
  Var05 = forecast_S06_Var05$mean,
  Var07 = forecast_S06_Var07$mean)
forecasts_df_S05 <- data.frame(
  SeriesInd = (subset_S05_clean_Var02$SeriesInd[last_observation_index_S05] + 1):(subset_S05_clean_Var02$SeriesInd[last_observation_index_S05] + 140),
  category = rep("S05", 140),
  Var02 = forecast_S05_Var02$mean,
  Var03 = forecast_S05_Var03$mean)
forecasts_df_S04 <- data.frame(
  SeriesInd = (subset_S04_clean_Var01$SeriesInd[last_observation_index_S04] + 1):(subset_S04_clean_Var01$SeriesInd[last_observation_index_S04] + 140),
  category = rep("S04", 140),
  Var01 = forecast_S04_Var01$mean,
  Var02 = forecast_S04_Var02$mean)
forecasts_df_S03 <- data.frame(
  SeriesInd = (subset_S03_clean_Var07$SeriesInd[last_observation_index_S03] + 1):(subset_S03_clean_Var07$SeriesInd[last_observation_index_S03] + 140),
  category = rep("S03", 140),
  Var05 = forecast_S03_Var05$mean,
  Var07 = forecast_S03_Var07$mean)
forecasts_df_S02 <- data.frame(
  SeriesInd = (subset_S02_clean_Var02$SeriesInd[last_observation_index_S02] + 1):(subset_S02_clean_Var02$SeriesInd[last_observation_index_S02] + 140),
  category = rep("S02", 140),
  Var02 = forecast_S02_Var02$mean,
  Var03 = forecast_S02_Var03$mean)
forecasts_df_S01 <- data.frame(
  SeriesInd = (subset_S01_clean_Var01$SeriesInd[last_observation_index_S01] + 1):(subset_S01_clean_Var01$SeriesInd[last_observation_index_S01] + 140),
  category = rep("S01", 140),
  Var01 = forecast_S01_Var01$mean,
  Var02 = forecast_S01_Var02$mean)

# Remove the last 140 rows from subset_S06 to append forecasts
n_rows_S06 <- nrow(subset_S06)
subset_S06 <- subset_S06[1:(n_rows_S06 - 140), ]
n_rows_S05 <- nrow(subset_S05)
subset_S05 <- subset_S05[1:(n_rows_S05 - 140), ]
n_rows_S04 <- nrow(subset_S04)
subset_S04 <- subset_S04[1:(n_rows_S04 - 140), ]
n_rows_S03 <- nrow(subset_S03)
subset_S03 <- subset_S03[1:(n_rows_S03 - 140), ]
n_rows_S02 <- nrow(subset_S02)
subset_S02 <- subset_S02[1:(n_rows_S02 - 140), ]
n_rows_S01 <- nrow(subset_S01)
subset_S01 <- subset_S01[1:(n_rows_S01 - 140), ]

Combine Original and Forecasted Data

# Combine original and forecasted data
combined_df_S06 <- rbind(subset_S06, forecasts_df_S06)
combined_df_S05 <- rbind(subset_S05, forecasts_df_S05)
combined_df_S04 <- rbind(subset_S04, forecasts_df_S04)
combined_df_S03 <- rbind(subset_S03, forecasts_df_S03)
combined_df_S02 <- rbind(subset_S02, forecasts_df_S02)
combined_df_S01 <- rbind(subset_S01, forecasts_df_S01)

# Add a label column to differentiate original and predicted data points
n_rows_combined_S06 <- nrow(combined_df_S06)
combined_df_S06$label <- "original"
combined_df_S06$label[(n_rows_combined_S06 - 139):n_rows_combined_S06] <- "predicted"
n_rows_combined_S05 <- nrow(combined_df_S05)
combined_df_S05$label <- "original"
combined_df_S05$label[(n_rows_combined_S05 - 139):n_rows_combined_S05] <- "predicted"
n_rows_combined_S04 <- nrow(combined_df_S04)
combined_df_S04$label <- "original"
combined_df_S04$label[(n_rows_combined_S04 - 139):n_rows_combined_S04] <- "predicted"
n_rows_combined_S03 <- nrow(combined_df_S03)
combined_df_S03$label <- "original"
combined_df_S03$label[(n_rows_combined_S03 - 139):n_rows_combined_S03] <- "predicted"
n_rows_combined_S02 <- nrow(combined_df_S02)
combined_df_S02$label <- "original"
combined_df_S02$label[(n_rows_combined_S02 - 139):n_rows_combined_S02] <- "predicted"
n_rows_combined_S01 <- nrow(combined_df_S01)
combined_df_S01$label <- "original"
combined_df_S01$label[(n_rows_combined_S01 - 139):n_rows_combined_S01] <- "predicted"

# Plot Var05 over SeriesInd for subset_S06
plot_SO6_var05 <- ggplot(combined_df_S06, aes(x = SeriesInd, y = Var05, color = label)) +
  geom_line() +
  labs(title = "Var05 over SeriesInd", x = "SeriesInd", y = "Var05") +
  scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
  theme_minimal()
plot_SO6_var07 <- ggplot(combined_df_S06, aes(x = SeriesInd, y = Var07, color = label)) +
  geom_line() +
  labs(title = "Var07 over SeriesInd", x = "SeriesInd", y = "Var07") +
  scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
  theme_minimal()
plot_S05_Var02 <- ggplot(combined_df_S05, aes(x = SeriesInd, y = Var02, color = label)) +
  geom_line() +
  labs(title = "Var02 over SeriesInd - S05", x = "SeriesInd", y = "Var02") +
  scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
  theme_minimal()
plot_S05_Var03 <- ggplot(combined_df_S05, aes(x = SeriesInd, y = Var03, color = label)) +
  geom_line() +
  labs(title = "Var03 over SeriesInd - S05", x = "SeriesInd", y = "Var03") +
  scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
  theme_minimal()
plot_S04_Var01 <- ggplot(combined_df_S04, aes(x = SeriesInd, y = Var01, color = label)) +
  geom_line() +
  labs(title = "Var01 over SeriesInd - S04", x = "SeriesInd", y = "Var01") +
  scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
  theme_minimal()
plot_S04_Var02 <- ggplot(combined_df_S04, aes(x = SeriesInd, y = Var02, color = label)) +
  geom_line() +
  labs(title = "Var02 over SeriesInd - S04", x = "SeriesInd", y = "Var02") +
  scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
  theme_minimal()
plot_S03_Var05 <- ggplot(combined_df_S03, aes(x = SeriesInd, y = Var05, color = label)) +
  geom_line() +
  labs(title = "Var05 over SeriesInd - S03", x = "SeriesInd", y = "Var05") +
  scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
  theme_minimal()
plot_S03_Var07 <- ggplot(combined_df_S03, aes(x = SeriesInd, y = Var07, color = label)) +
  geom_line() +
  labs(title = "Var07 over SeriesInd - S03", x = "SeriesInd", y = "Var07") +
  scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
  theme_minimal()
plot_S02_Var02 <- ggplot(combined_df_S02, aes(x = SeriesInd, y = Var02, color = label)) +
  geom_line() +
  labs(title = "Var02 over SeriesInd - S02", x = "SeriesInd", y = "Var02") +
  scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
  theme_minimal()
plot_S02_Var03 <- ggplot(combined_df_S02, aes(x = SeriesInd, y = Var03, color = label)) +
  geom_line() +
  labs(title = "Var03 over SeriesInd - S02", x = "SeriesInd", y = "Var03") +
  scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
  theme_minimal()
plot_S01_Var01 <- ggplot(combined_df_S01, aes(x = SeriesInd, y = Var01, color = label)) +
  geom_line() +
  labs(title = "Var01 over SeriesInd - S01", x = "SeriesInd", y = "Var01") +
  scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
  theme_minimal()
plot_S01_Var02 <- ggplot(combined_df_S01, aes(x = SeriesInd, y = Var02, color = label)) +
  geom_line() +
  labs(title = "Var02 over SeriesInd - S01", x = "SeriesInd", y = "Var02") +
  scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
  theme_minimal()

Visualization of Forecasted Data

Conclusion

In this project we walked through a full time series analysis to understand a de-identified dataset and ultimately to generate forecasts for 12 individual variables.